Whole Genome Sequence Dataset of Mycobacterium tuberculosis Strains from Patients of Campania Region

Tuberculosis (TB) is one of the deadliest infectious disorders in the world. To effectively TB manage, an essential step is to gain insight into the lineage of Mycobacterium tuberculosis (MTB) and the distribution of drug resistance. Although the Campania region is declared a cluster area for the infection, to contribute to the effort to understand TB evolution and transmission, still poorly known, we have generated a dataset of 159 genomes of MTB strains, from Campania region collected during 2018–2021, obtained from the analysis of whole genome sequence. The results show that the most frequent MTB lineage is the 4 according for 129 strains (81.11%). Regarding drug resistance, 139 strains (87.4%) were classified as multi susceptible, while the remaining 20 (12.58%) showed drug resistance. Among the drug-resistance strains, 8 were isoniazid-resistant MTB, 4 multidrug-resistant MTB, while only one was classified as pre-extensively drug-resistant MTB. This dataset expands the existing available knowledge on drug resistance and evolution of MTB, contributing to further TB-related genomics studies to improve the management of this disease.


Background & Summary
Tuberculosis (TB) is a major global health threat that affects millions of people worldwide and has significant social and economic impacts 1 .In 2021, an estimated 10.6 million people were affected by TB, and 1.6 million deaths were reported globally (https://www.who.int/news-room/fact-sheets/detail/tuberculosis).The direct healthcare costs of TB are also high, with an average of $567,708 per TB case 2 .In Italy alone, 2146 cases were reported 3 .TB is caused by Mycobacterium tuberculosis complex (MTBC) 4 , comprising different lineages with varying geographical locations and spreads 5 .MTBC includes Mycobacterium tuberculosis (MTB) sensu stricto, which includes lineages 1, 2, 3, 4, 7, and 8, and MTB var.africanum, comprising lineages 5, 6, and 9 6 .Lineage 1 is widespread in East Africa, while lineage 2 is highly mobile, spreading in Asia, Africa, and Europe 7 .Lineage 3 is mainly located in southern Asia and northern and eastern Africa, while lineage 4 is common in Europe and southern Africa 8 .Lineages 5, 6, and 7 are endemic in West Africa and Ethiopia 9 .In recent years, new lineages 8 and 9 have been detected in central and east Africa, respectively 10 .Several evidences report that lineages differ in transmission, progression and severity of the disease caused, vaccine, diagnosis and drug efficacy, and drug resistance 11 .Indeed, lineages 5 and 6 are closely associated with extrapulmonary infections, while variant 4 is more related to pulmonary manifestations 12 .Different studies have highlighted the immunological recovery of patients infected with lineage 6 MTB, compared with those with lineage 4 MTB 13 .In contrast to the other lineages, lineage 6 responds more slowly to treatment with first-line drugs 14 .The latter grows more slowly in vitro and is more associated with a false negative culture 9 .Lineages 3, 4, and 5 have more virulence factors than lineage 7 15 .Moreover, lineages 2 and 3 have a strong propensity to acquire gene determinants of drug resistance 16 .Multidrug-resistant tuberculosis (MDR-TB) poses a serious threat to public health.In 2021, approximately 450,000 MDR-TB cases occurred, resulting in 191,000 deaths worldwide.Standard first-line treatment is hardly ineffective for MDR-TB patients.Indeed, only about 1 in 3 MDR-TB patients had access to appropriate treatment in 2021 17 .Monitoring the spread of different lineages and drug-resistant strains is crucial to improve TB control.The GENEXPERT MTB/RIF test is broadly exploited in most hospitals in the country.This assay fails to discriminate lineages and highlights only rifampicin resistance 18 .Whole genome sequencing (WGS) has become an essential tool to acquire comprehensive genetic information regarding strains of TB, leading to improved disease control and containment of its global health impact 19 .The characterization of genetic diversity in locally detected MTBC strains through WGS is crucial for understanding the transmission and evolution of TB drug resistance in Italy.In our study, we aimed to provide a comprehensive dataset of MTBC-positive individuals, which would enable further investigation of the impact of MTBC infection on the population.To achieve this, we sequenced and analysed the genomes of 159 MTB isolates, obtained from patients in the Campania region during 2018-2022.Our analysis focused on genetic diversity and on the identification of variants associated with drug resistance.The study design and data collection process are illustrated in Fig. 1.Notably, through WGS analysis, we successfully identified drug response and resistance (Fig. 2a) as well as several lineages spread across the four participating hospitals (Fig. 2b).This approach also allowed us to observe the distribution of region-specific MTB variants, which can contribute to infection monitoring efforts.The proposed WGS dataset provides valuable insights into the biological impact of MTB distribution.Researchers can analyse this dataset in conjunction with others to identify key lineages and significant gene mutations associated with drug resistance.Furthermore, the dataset includes various clinical factors (such as patients' provenance, starting biological material, and antibiogram assay results) that can be utilized to investigate the relationship between these factors and MTB infection.The identification of a diverse and heterogeneous population of MTB lineages, along with the presence of antibiotic resistance, offers to the researchers a wealth of data to conduct versatile studies.These findings can be instantly accessed to facilitate correlation studies between phenotypic and genotypic data, enabling the identification of drug-resistance mutations and markers associated with disease progression.Overall, our data could implement available studies more effectively, improving TB management.

Methods
Sample cohort.This study involved 159 MTBC isolates collected retrospectively between 2018 and 2021 from four hospitals in the Campania region (Southern Italy).In detail, 41 strains came from Ascalesi hospital of Naples (Via Egiziaca A. Forcella, 31-80144 Napoli (NA)), 66 isolates from Cotugno hospital (Via Quagliariello, 54-80131 Napoli (NA).Twenty-three strains were enrolled at AORN S.G.Moscati (Contrada Amoretta -83100 Avellino (AV)), and 29 isolates were collected at AO Sant' Anna e San Sebastiano of Caserta (Via Ferdinando Palasciano -81100 Caserta (CE)).Most of the isolates belonged to patients from European countries (n = 98) and African countries (n = 44), only six and five subjects came from Asia and South America respectively (for six cases no information was available), respectively.MTB isolates were isolated from both pulmonary (n = 140) and extrapulmonary (n = 19) sites.All these data are summarised in Tables 1-4.
ethical considerations.The study protocol was subjected to review by the ethics committee of the participating hospitals.Following a thorough assessment, the committee determined that the samples under investigation were not human, obviating the need for ethical approval for the study.Furthermore, the bacterial strains exploited for this research were subjected to anonymization through the application of a distinctive identification code, encrypted to safeguard the privacy of the subjects involved.Consequently, obtaining informed consent from patients was deemed unnecessary.

MTB isolates cultivation and antibiogram.
The bacterial clinical isolates examined in this study were obtained from patients as part of routine diagnostic requests conducted by collaborating hospitals.Biological samples were processed using the standard N-acetyl-1-cysteine-sodium hydroxide (NALC-NaOH) method for digestion, decontamination, and concentration of bacterial load.The pellet was resuspended in approximately 2 mL of phosphate buffer (pH 7) and mixed thoroughly.The suspension was used for microscopic analysis and for setting up bacterial culture.A volume of 250 and 500 μl of the suspension were inoculated in BACTEC MGIT 960 (Becton Dickinson, Franklin Lakes, NJ, USA) and on solid culture media Löwenstein-Jensen (LJ).Identification of mycobacterial species was performed by IS6110-based PCR.Five first-line drugs including streptomycin (STM, 1.0 μg/mL), isoniazid (INH, 0.1 μg/mL), rifampicin (RIF, 1.0 μg/mL), Pyrazinamide (PRZ, 100 μg/mL) and ethambutol (EMB, 5 μg/mL) were tested using the Mycobacterium Growth Indicator Tube 960 (MGIT 960) system 20 .After identification and susceptibility determination, bacterial stocks were prepared, catalogued, and stored at Genomic DNA extraction and whole-genome sequencing.Genomic DNA extractions were performed at the hospitals where the strains were isolated.Genomic DNA was obtained from clinical strains using two column-based DNA extraction methods (QIAamp DNA minikit, Qiagen, Germany), according to the instructions.Total DNA concentration was determined by using Quant-IT DNA Assay Kit and a Qubit Fluorometer (Life Technologies, Monza, Italy) and its purity was determined by using NanoDrop spectrophotometer ND-2000 (Thermo Fischer Scientific) through the evaluation of the absorbance ratio A260/A280 and A230/A280.Library preparation and sequencing processes were carried out at Laboratory of Molecular Medicine and Genomics, a research lab of the University of Salerno (Italy).Indexed libraries were prepared starting from 60 ng of each DNA sample according to Illumina DNA Prep sample preparation kits (Illumina Inc., San Diego, CA, USA).Final library concentrations and size were assessed with Quant-IT DNA Assay Kit and Agilent 4200 Tapestation System (Agilent Technologies, Milan, Italy), respectively.Then, 159 libraries were equimolarly pooled, diluted to a final concentration of 1.3 pMol and sequenced in paired-end, 300 bp, on the Illumina NexSeq.500 platform (Illumina, San Diego, CA).

Genomic analysis.
The sequenced reads were quality checked with FastQC v0.11.3 21 .The low-quality and adapter-related fragments were removed using cutadapt v 1.18 with the following parameters: -m = 20 (minimum read length); -q = 20 (minimum read quality) 22 .The high-quality reads were then imported into TB-profiler 23 with-min_depth option set to 50, to retain only mutation supported by at least 50 reads.TB-profiler (v4.2.0) analysis allowed lineages assignment and antimicrobial susceptibility prediction (drug resistance) working with MTB H37Rv reference genome (GenBank accession: GCA_000195955.2) and resistance is predicted using the curated database provided with TB-profiler software.This database has been tested using over 17,000 samples with genotypic and phenotypic MTBC data.The phylogenetic trees were inferred based on the whole genome single nucleotide polymorphism, as proposed by Senghore et al. using TB-profiler intermediate alignment files 24 .In detail,.bamfiles were processed with freebayes v.1.3.5 to call variants 25 using the following parameters: -p 1,-min-coverage 5, -q 20.Then, variant calling files (.vcf) were processed with snippy v3.1 26 to filter out non-significant mutation using the snippy-vcf_filter function, with-minfrac and-minqual parameters set to 0.1 and 20 respectively.The bcftools software (version 1.12) 27 has been used to generate consensus sequences for each isolate, which have been given in input to JolyTree (v.to compute a fast distance-based phylogenetic inference from unaligned genome sequences.Finally, JolyTree output files (.newick) have been used as input to iTol online software 29 for tree construction and visualisation.
and rapid method to reveal the presence of the mycobacterium and to confirm MTB-infected patients, highlighting the diverse distribution of both sensitive and drug-resistant isolates within each hospital participating in the study.Supplementary File 1 and Fig. 2 provide an evaluation of the quality of the WGS data generated, including the mapping percentage on the MTB reference genome, lineage distribution and resistance profiles.
Starting from the extracted genomic DNA, whole genome sequencing was performed.The sequencing of 159 MTB isolates produced in total 424,760,352 reads (range 1,392,456 −5,035,024 reads), corresponding to an average of 2,671,448.75reads per sample.After low-quality reads filtering and adapter removal, 424,511,352 reads (2,669,882.72 reads per sample, range 1,391,848 -5,032,296 reads) remained for downstream analysis.Our results showed an overall high percentage of mapping, about 98.24% of high-quality reads per sample, in fact, aligned on the established MTB reference genome (H37Rv), along with a median coverage value of 66.9 per sample.The in silico analysis of the 159 isolates detected eight different MTB lineages: lineage1 (n = 4), lineage2 (n = 5), lineage3 (n = 12), lineage4 (n = 129), lineage5 (n = 1), lineage6 (n = 5) and lineage2_lineage4 (n = 2) and lineage9 (n = 1).Among all hospitals enrolled, the dominant lineage was Euro-American 4 (n = 129, 81.13%), with 32 identified at Ascalesi Hospital, 61 at Cotugno Hospital, 15 at Moscati Hospital and 21 at Sant' Anna e San Hospital Sebastian.All lineage assignments are summarized in Fig. 2b.In the context of antibiotic resistance, approximately ~12.6% of samples (20/159 samples) were nonsusceptible while 139 MTB isolates were classified as antibiotic susceptible.
In detail, eight, four and one isolate were reported as HR-TB, MDR-TB and Pre-XDR-TB respectively, while for seven samples resistance to only one drug was found, as reported in Fig. 2a.The antibiogram results were in agreement with the data obtained from drug resistance analysis by WGS, showing a high percentage of correspondence between in silico prediction and antibiotic tests.In fact, approximately 93.3% of the drugs tested (653/700 tests) showed the same trend in the WGS prediction analysis.Despite the limited prevalence of MDR-MTB isolated in relation to the overall size of the sample, the integration of this set of data with comparable ones, accompanied by rigorous data validation procedures, can basically improve the statistical power of the analysis.In addition, this integration could significantly contribute to the advancement of new therapies and diagnostic tools in the ongoing battle against TB.

Fig. 1
Fig. 1 Study workflow.Data collection and procedure pipeline are shown.

Fig. 2
Fig. 2 In-silico profiling of MTB isolates with lineages assignment and drug-resistance analysis results.Circular tree reporting the in-silico prediction of the resistance to the tested antibiotics and the phylogenetic distance that characterized the 159 MTB isolates.The 159 MTB isolates were classified as sensitive (green) (n = 139), HR-TB (light purple) (n = 8), Other (blue) (n = 7), MDR-TB (orange) (n = 4) and Pre-XDR-TB (red) (n = 1) (a).Histogram plot showing the distribution of all lineages among the four hospitals enrolled in this study (b).

Table 1 .
Origin and phenotypic antibiotic resistance profile of the M. tuberculosis strains isolated at AORN S.G.

Table 3 .
Origin and phenotypic antibiotic resistance profile of the M. tuberculosis strains isolated at AO Sant' Anna and San Sebastiano (N.R., not received; N.T., not tested).